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Scaling Multidimensional Inference for 
Structured Gaussian Processes 

Elad Gilboa, Yunus Saatci, and John P. Cunningham 

Abstract — Exact Gaussian Process (GP) regression has 0(N a ) runtime for data size N, making it intractable for large N. Many 
algorithms for improving GP scaling approximate the covariance with lower rank matrices. Other work has exploited structure 
inherent in particular covariance functions, including GPs with implied Markov structure, and equispaced inputs (both enable O(N) 
runtime). However, these GP advances have not been extended to the multidimensional input setting, despite the preponderance 
of multidimensional applications. This paper introduces and tests novel extensions of structured GPs to multidimensional inputs. We 
present new methods for additive GPs, showing a novel connection between the classic backfitting method and the Bayesian framework. 
To achieve optimal accuracy-complexity tradeoff, we extend this model with a novel variant of projection pursuit regression. Our primary 
result - projection pursuit Gaussian Process Regression - shows orders of magnitude speedup while preserving high accuracy. The 
natural second and third steps include non-Gaussian observations and higher dimensional equispaced grid methods. We introduce 
novel techniques to address both of these necessary directions. We thoroughly illustrate the power of these three advances on several 
datasets, achieving close performance to the naive Full GP at orders of magnitude less cost. 

Index Terms — Gaussian Processes, Backfitting, Projection-Pursuit Regression, Kronecker matrices. 
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1 Introduction 

Gaussian Processes (GP) have become a popular tool for 
nonparametric Bayesian regression. Naive GP regression 
has 0(N 3 ) runtime (matrix inversions and determinants) 
and 0(N 2 ) memory complexity, where N is the number 
of observations. At ten thousand or more, this problem 
is for all practical purposes intractable, given current 
hardware. 

A significant amount of research has gone into 
sparse approximations (reducing run-time complexity to 
0(Ad 2 N) for some M -c N). For an excellent review of 
sparse GP approximations, see [1J. All sparse approxima- 
tion methods are based on the assumption of conditional 
independence of the training and test sets, given an 
active set of inducing inputs. As emphasized in [1], the 
results of these algorithms can depend strongly on the 
properties of the data. Since different assumptions fit 
different datasets, and since sparsity has by no means 
solved all efficiency issues for GPs, it is imperative to 
explore alternative avenues for attaining scalability. 

The central aim of this paper is to introduce struc- 
tured GPs for multidimensional inputs. Specifically we 
present three novel advances which allow efficient and 
sometimes exact inference, or at least a superior runtime- 
accuracy tradeoff than existing methods. We say a GP is 
structured if its marginals p(f |X, 6) contain exploitable 
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structure that enables reduction in computational com- 
plexity. While these structured GP methods are known 
in the case of scalar inputs, many regression applications 
involve multivariate inputs. Our main contribution is 
three nontrivial extensions of these algorithms to deal 
with this case. 



1.1 Gaussian Process Regression 

In brief, GP regression is a Bayesian method for non- 
parametric regression, where a prior distribution over 
continuous functions is specified via a Gaussian process, 
(the use of GP in machine learning is well described in 
0). 

A GP is a distribution on / over an input space X such 
that any finite selection of input locations xi, . . . , xjy € X 
gives rise to a multivariate Gaussian density over the 
associated targets, i.e., 
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where mjv = m(xi, . . . , xn) is the mean vector and 
Kw = {k(xi,Xj)}ij is the covariance matrix for mean 
function m and covariance function k. In this paper we 
are specifically interested in the basic equations for GP 
regression, which involve two steps. First, for given data 
y (making the standard assumption of zero-mean data, 
without loss of generality), we calculate the predictive 
mean and covariance at M unseen inputs as: 
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For model selection, since the function k(-,-;9) is pa- 
rameterized by hyperparameters such as amplitude and 
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lengthscale (which we group into 9), we must consider 
the log marginal likelihood Z(9): 



log Z{6) = --(y T (K N +all 



N) 



+ log(det((K A r + o*I N ))) + Nlog(2ir)) .(4) 

Here we use this marginal likelihood to optimize over 
the hyperparameters in the usual way [2J. The runtime 
of GP regression and hyperparameter learning is 0(N 3 ) 
due to (Kjv + ofilw) / which is present in all equations. 



1 .2 Gauss-Markov Processes 

We briefly review the use of Gauss-Markov Processes 
for efficient GP regression on scalar inputs, as a starting 
point for the multidimensional extensions in Section |2] 
Although the Gauss-Markov Process are well studied, 
their use for exact and efficient GP regression is under- 
appreciated. A GP with a kernel corresponding to a state- 
space model can be viewed as a Gauss-Markov Process, 
enabling linear runtime. Gauss-Markov Processes can be 
viewed as the solution of an order-?7i linear, stationary 
stochastic differential equation (SDE), given by: 



d m f{x) 



dx r 



df(x) , , 

-ai— ha /(x) 

dx 



+ Qm-i - ; dr t~ H hOi - +a a f(x) = w(x), 

(5) 

where w(x) is a zero-mean white noise process. Note 
that x can be any scalar input, including time. Because 
w(x) is a GP and / is linear in its coefficients, / is also 
a GP. See [3| for an excellent introduction to SDEs. We 
can rewrite Eq. (0 as a vector Markov process: 
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dx 



Az(x) + Lw(x), 
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and where L = [0,0,..., 1], and A is the usual suitable 
coefficient matrix. Eq. (J6) shows that, given knowledge 
of f(x) and its first m derivatives, we have Markov 
structure in the graph underlying GP inference, which 
will enable all efficiency gains in this section. 

Earlier work [4[, [5|, derived the SDEs corresponding 
to several commonly used covariance functions includ- 
ing the Matern family and spline kernels, and good 
approximate SDEs corresponding to the exponentiated- 
quadratic kernel. Once the SDE is known, the Kalman 
filtering [6 J and Rauch-Tung-Striebel (RTS) [7| smoothing 
algorithms (which correspond to belief propagation) can 
be used to perform GP regression in 0{N) time and 
memory, a noteworthy leap in efficiency. Note that the 
Gauss-Markov Process framework requires sorted input 
points. Thus, if a sorting step is required to preprocess 
the data, the runtime complexity will be 0(A log iV). 
However, as this is not always relevant and certainly not 
the focus of the algorithms presented here, we assume 
the inputs are sorted in advance and refer to these 



models as 0{N) in runtime complexity. For this paper 
we will summarize the usual system equations as: 

Initial state : p{z{x x )) = Af(z(x 1 ); V x ), (8) 
State update : p(z(xi)|z(xi_i)) 

= jv , (a(x,);# i _ia(x i _i),Q i _ 1 ), (9) 
Emission : p{y{xi)\z(xi)) 

= N{y{x t )-h T z{x l ),a 2 n ), (10) 

where we assume that the inputs x$ are sorted in as- 
cending order, and the system matrices <f> and Q are 
functions of the original GP's hyperparameters. Using 
these update and emission equations in the standard 
Kalman or RTS framework allows exact regression over 
the Gauss-Markov Process for a single-input dimension. 

2 Structured GP on Multiple Input Di- 
mensions 

Despite its importance for a variety of applications, 
tractable extension of the state-space models (Section 
11.2b to higher-dimensional input spaces has not been 
addressed in the literature. Doing so involves a number 
of novel steps and is the primary contribution of this 
work. We introduce three new algorithms for struc- 
tured GPs over multivariate inputs, namely: (Sec. I2.1JI 
additive multidimensional regression (with extension to 
additive covariates), (Sec. 12.21 1 non-Gaussian likelihood 
extensions, and (Sec. 12.31 1 GPs over a multidimensional 
grid. Full details of the development of these algorithmic 
extensions, including important proofs, can be found in 
our supporting work 0. 

2.1 GP Regression for Multidimensional State- 
Space Models 

For the purposes of extending one-dimensional Gauss- 
Markov Processes (Sec. \1.2) to multiple dimensions, we 
use the assumption of additivity. The optimal accuracy- 
efficiency tradeoff will be presented in Section l2.1.3l Here 
we introduce the building blocks. The resulting model 
regresses a sum of D Gauss-Markov Processes (which 
are independent a priori), where D > 1 is the dimen- 
sionality of the input space. Additive GP regression can 
be described using the following generative model: 



I), 



= J2 f d(X itd ) + e i=l,...,N, 



(11) 



d=l 



fd{-)~gV(0;k d (x d ,x.' d ;6 d )) d = l,...,D, (12) 
e~Af(0,^), (13) 

where X^ d is the d-th component of input i, k d (-,-) is 
the kernel of the scalar GP along dimension d, 9 d repre- 
sent the dimension-specific hyperparameters, and a\ is 
the (global) noise hyperparameter. Although interactions 
between input dimensions are not modeled a priori, an 
additive model does offer interpretable results - one can 
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simply plot the posterior mean of the individual to 
visualize how each predictor relates to the target [8[. 

We introduce a novel multidimensional Gauss-Markov 
Process regression where the underlying algorithm can 
be viewed as a Bayesian interpretation of the classical 
backfitting method [9[, [10|. As described in ITTT1 , a 
nonparametric regression technique (such as the spline 
smoother) which allows a scalable fit over a scalar input 
space can be used to fit an additive model over a D- 
dimensional space with the same overall asymptotic 
complexity by means of the backfitting algorithm. 

Surprisingly the application of backfitting (Algorithm 
H) can be proved to converge to the exact posterior mean. 
The easiest way to see this is by viewing (Algorithm 
H) as a Gauss-Seidel iteration. We detail this in-depth 
proof in our supporting work 1 5 ] . As a reminder, Gauss- 
Seidel is an iterative technique to solve linear systems, 
in this case solving for the exact posterior mean (see 
|[T2| for a great Gauss-Seidel reference). It is precisely 
the additive Gauss-Markov Process structure that makes 
the backfitting update equivalent to a Gaussi-Seidel step. 



Algorithm 1: Efficient Computation of Additive GP 
Posterior Mean via Backfitting 

inputs : Training data {X,y}. Suitable covariance 

function. Hypers 9 = UdLii^d} u a n- 
outputs: Posterior training means: Yld=i f^d' wnere 

1 Zero-mean the targets y; 

2 Initialise the fi d (e.g. to 0); 

3 while The change in fi d is above a threshold do 

4 for d = 1, . . . , D do 

5 M d ^E(f d |y-X^ d /t^,X: jd ,0d,cr2); > use 
state-space model here. 

6 end 

7 end 



To calculate posterior variances and learn hyperpa- 
rameters, we must investigate further. We can express 
the underlying graphical model as in Fig. [TJ where we 
have made the state-space representation of each scalar 
Gauss Markov process explicit. The observed variables 
are the targets y, and the latent variables Z consist of 
the D Markov chains: 

Z = I Z \ , . . ■ , zj^ , z\ ; * ** ; i ■ ■ ■ , Zj), ■ • ■ , Zj J • (14) 
\ =Zi =7*2 =Zd J 

Unfortunately, the true posterior p(Z±, . . . , Zu|y, X, 9) 
is hard to handle computationally because all variables 
Zi are coupled in the posterior. Although everything is 
still Gaussian, we are no longer able to use the efficient 
state-space methods of Section 11.21 returning us to the 
original computational intractability at large N. Thus, 
we require an approximate inference technique such as 
variational Bayesian expectation maximization (VBEM) 




Fig. 1: Graphical model for efficient additive GP regression. 
Each dimension is written in its corresponding state-space 
model. 



or Markov Chain Monte Carlo (MCMC) [13|. We now 
briefly introduce our use of these well-known tech- 
nologies, as the details will demonstrate the important 
connection to the backfitting algorithm. Note, that the 
main benefits of using these algorithms comes from their 
scalability as they are able to inherit the linear time 
complexity of the state-space model. 

2.1.1 Variational-Bayesian Expectation Maximization 

E-Step: We use a variational-Bayesian (VB) approxima- 
tion to the E-step by making the standard assumption 
of an approximate posterior that factorizes across the Zi, 
i.e.: 

D 

g (Z) = JJg(Zi). (15) 

i=i 

Given such a factorized approximation, it can be shown 
that KL(g(Z) |p(Z|y, 9)) can be minimized in an iterative 
fashion, using the following central update rule 1131 : 

log g (Z 3 -) = Eiyy(logp(y,Z|0)) + const. (16) 

where E^(-) is an expectation with respect to 
g(Zi). Using Eqs. (15) and we derive the it- 
erative updates required for VBEM. We first write down 
the log joint over all variables, given by: 

N / D \ 

log(p(y, Z\9)) = ]T \ogp y n \h T £ z** (n) , ^ + 

n=l \ d=l I 

D N 

^T^Tlog^lz*- 1 ,^), (17) 

d=l t=l 

where we have defined p(z t d \z t d ^ 1 , 9 d ) = p{z d \9d), for 
t = 1, and h T z gives the first element of z. Note that 
it is also necessary to define the mapping t d (-) which 
gives, for each dimension d, the state-space model index 
associated with y n . The index t iterates over the sorted 
input locations along axis d. Because the expectation of 
the right hand side of Eq. ((T7t does not depend on z--, 
we can consider that the mean of the Gaussian in the 
first term of Eq. (I7l l, allowing us to write: 
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(n) 



t=l 

where 



^io g p(zji z r i .^')+ const - 

[zf] = | ztq^dZi. 



, T tj(n) 2 

;h z/ v ',<r_ 



(18) 



(19) 



A key and somewhat surprising outcome of Eq. l(T8l is 
that in order to update the factor q(Zj) in the E step, 
it is sufficient to run the standard state-space model 
inference procedure using only the pseudo-observations: 



E 



Vn - h T - 
A number of conclusions can be drawn from this 
connection. First, since VB iterations are guaranteed to 
converge, any moment computed using the factors q(Zi) 
is also guaranteed to converge. Convergence of these 
moments is important because they are used to learn 
the hyperparameters. Second, since the true posterior 
p(Zi, . . . ,Zj)\y,0) is a large joint Gaussian over all the 
latent variables, E g (Z) will be equal to the true posterior 
mean. This is because the true posterior is Gaussian 
(unimodal with the mean as its mode) and the VB 
approximation is mode-seeking fl4l . Additionally, as is 
typical for variational methods, the posterior covariance 
will be underestimated because KL(g(Z)||p(Z|y, 9)) is an 
exclusive divergence measure fl4|. 

The central VB update is precisely a backfitting up- 
date, thus illustrating a novel connection between ap- 
proximate Bayesian inference for additive models and 
classical estimation techniques. Furthermore, this pro- 
vides an alternative proof of why backfitting computes 
exact posterior means over latent function values. 

M-Step: We must optimize E g (logp(y, Z\6)) over 9. 
Using Eq. (|l7t it is easy to show that the expected 
sufficient statistics required to compute derivatives with 
respect to 9 are the set of expected sufficient statistics 
for the state-space model associated with each individual 
dimension. This separability is another major advantage 
of using the factorized approximation to the poste- 
rior. Thus, for every dimension d, we use the Kalman 
filter and RTS smoother to compute {E 9 ( Zd )( z d)}„_ 1 / 

{V, ( z 4 )(^)d and {^q(z d )( z d z d +1 )}n=i ' We then use 
these expected statistics to compute derivatives of the 
expected complete data log-likelihood with respect to 
9 and use a standard minimizer (we use a conjugate 
gradient method) to complete the M step. 



2. 1.2 Markov Chain Monte Carlo (MCMC) 

An important and customary comparison to VB is 
MCMC, which carries the usual benefits of approximate 
hyperparameter integration, but at a reduced efficiency 



1 15]. Here we briefly discuss our fairly standard MCMC 
implementation, noting only the important differences. 

As in standard MCMC, we extended the model to 
include a prior over the hyperparameters. The hyper- 
parameters for each univariate function are given 
a prior parameterized by vi, ct T , (3 T }, where {ni,vi} 
correspond to the covariance function hyperparameter I 
and {a r ,/3 T } to r^. We also place a r(a„,/3„) prior over 
the noise precision hyperparameter r„. We run Gibbs 
sampling where we block-sample the latent chains. The 
algorithm used to sample from the latent Markov chain 
in a state-space model has been called the forward- 
filtering, backward sampling algorithm, where forward 
filtering is followed by a backward sampling from the 
conditionals p(z k \z s k a J ^ ple ; y; X :jd ; 9 d ) [16]. The sampling 
is initialized by sampling from p(z^-|y; X :j( j; 9d), which 
is computed in the final step of the forward filtering 
run, to produce z s £ mple . The forward-filtering, backward 
sampling algorithm generates a sample of the entire state 
vector jointly (over training and test input locations). 

2. 1.3 Efficient Projected Additive GP Regression 

So far, we have shown how the assumption of additivity 
can be exploited to derive non-sparse GP regression 
algorithms which scale as 0(N). These considerable 
efficiency gains can however decrease accuracy and 
predictive power versus a full unstructured GP, due to 
the limited expressivity of the simple additive model. 
To address this, we now demonstrate a relaxation of 
the additivity assumption without sacrificing the 0(N) 
scaling, by considering an additive GP regression model 
in a feature space linearly related to original space of 
covariates fl7|. [TBI . We call this algorithm projection 
pursuit Gaussian Process regression (PPGPR). 

We show that learning and inference for such a model 
can be performed by using projection pursuit GP regres- 
sion, a novel fusion of the classical projection pursuit 
regression algorithm with GP regression, with no change 
to computational complexity. The graphical model illus- 
trating this idea is given in Figure |2] We refer to the 
following projected additive GP prior: 

M 

y, = f >»(^m«) + e i=l,...,N, (20) 



m— 1 



fm(-) ~ QV (0; k m (d> m , (j)' m ; 9 m )) 



(21) 
.,M, (22) 



Notice that the number of projections, M, can be less 
or greater than D. Forming linear combinations of the 
inputs before feeding them into an additive GP model 
significantly enriches the flexibility of the functions sup- 
ported by the prior above, including many terms which 
are formed by taking products of covariates, and thus 
can capture relationships where the covariates jointly 
affect the target variable. In fact, Eqs. d20b through to 
(|22l l are identical to the standard neural network model 
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where the nonlinear activation functions are modeled 
using GPs. 




Fig. 2: Graphical model for Projected Additive GP Regression. 
In general, M D. We present a greedy algorithm to select 
M, and jointly optimize W and {# m }^ =1 . 

For inference and learning, we now derive a novel 
greedy algorithm which is similar to another classical 
nonparametric regression technique known as projection 
pursuit regression [19]. 

Consider the case where M = 1. In this case, the result- 
ing projected additive GP regression model reduces to a 
scalar GP with inputs given by Xw. Recall from Section 
11.21 that, for a kernel that can be represented as a state- 
space model, we can use the EM algorithm to optimize 
9 with respect to the marginal likelihood efficiently, for 
some fixed w. It is possible to extend this idea and jointly 
optimize w and 9 with respect to the marginal likelihood, 
although we opt to optimize the marginal likelihood 
directly. Notice that every step of this optimization scales 
as O(N), since at every step we need to compute the 
marginal likelihood of a scalar GP (and its derivatives). 
These quantities are computed using the Kalman filter by 
differentiating the Kalman filtering process with respect 
to w and 9. All that is required is the derivatives of the 
state transition and process covariance matrices (<I>t and 
Q t , for t = 1, . . . , N — 1) with respect to w and 9. 

We now handle the case where M > 1 using a greedy 
approach. At each iteration we find the optimal pro- 
jection weights w. The greedy nature of the algorithm 
allows the learning of the dimensionality of the feature 
space, M, rather naturally - one keeps on adding new 
feature dimensions until there is no significant change in 
performance (e.g., normalized mean-squared error). One 
important issue which arises involves the initialization 
of w m at step to. In our simulations we chose to initialize 
the weights as those obtained from a linear regression of 
X onto the target /residual vector y rn . This method acts 
as an educated guess expecting a faster convergence rate. 
We call this algorithm, which learns W and 9, projection 
pursuit GP regression (PPGPR). To be clear, note that 
PPGPR is used for the purposes of learning W and 9 
only. Once this is complete, one can simply run the VB E- 
step to compute predictions. The PPGPR algorithm offers 



a bridge between the flexibility and elegance of the naive 
full GP, and the efficiency of its approximate additive 
counterpart. Its strength lies not only in the connection 
to backfitting, but also in its substantial improvement in 
performance and efficiency, as will be shown in Section 

El 

2. 1.4 Parallelization of State-Space Models 

The above algorithms can be parallelized to achieve 
even more speed up. By transforming the full GP to a 
state-space model formulation, we replaced calculating 
an inverse of a large joint covariance matrix to that 
of manipulating much smaller evolution and emission 
matrices (3?, Q), for all input locations. As these matrices 
are functions of the hyperparameters, they must be re- 
calculated in every iteration during the hyperparameters 
learning stage. Notice, however, that for a fix set of 
hyperparameters, the values of «&, and Q for all locations 
are independent, and hence can be precalculated in 
parallel. We used a very simple parallelization scheme 
across (up to) 8 worker threads. We will further discuss 
the gains this in Section |3~T1 It is important to note that 
as the speed of the CPUs has come to a halt and the 
number of cores is on the rise, the ability to use parallel 
schemes will be a must for any efficient GP algorithm in 
the future. 

2.2 Generalized Additive GP Regression 

So far we have shown new methods for scaling mul- 
tidimensional GP regression. Here, we extend the effi- 
ciency of these methods to include problems that need 
non-Gaussian likelihood functions, such as classification. 
Being able to efficiently handle large multidimensional 
datasets of non-Gaussian distributed targets is especially 
important in classification problems, which regularly 
have multiple input features, and a discrete label. Infer- 
ence and learning with non-Gaussian emissions neces- 
sitates the use of approximations to the true posterior, 
e.g., expectation propagation (EP) [20], and Laplace ap- 
proximation (LA) [2J. However, only a handful of works 
in the literature focus on tractable algorithms for big 
data where exact GP classification is intractable. These 
works are extensions of the sparse GP framework using 
various approximation methods [21 [, [22 1, [23]. Sparse 
GP classification, however, is not without problems: for 
example, pseudo input learning with EP and hyperpa- 
rameter learning expand the parameter space making 
the problem more susceptible to overfitting. Other works 
which use a subset of data are more stable; however, this 
can affect accuracy as the active subset may not be opti- 
mal for the sparse conditional independence assumption. 
Additionally, there is the problem of model selection for 
the number of points to use in the sparse active set. Thus, 
extending our structured GP model to the non-Gaussian 
case is necessary and useful. 

Here we derive, for additive structured GP kernels, 
an O(N) algorithm which performs MAP inference and 
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hyperparameter learning using Laplace's approximation. 
The resulting LA algorithm is both stable and accurate. 
Note, that although it has been suggested in literature 
that the EP approximation has superior approximation 
qualities (for binary classification [24]), the choice of LA 
is due to its computational benefits as, to our knowledge, 
there is no good method for lowering the computational 
complexity of the EP updates from 0(N 3 ). We now 
derive this LA algorithm, and then we end this section 
with another key insight showing that this algorithm is 
a Bayesian interpretation of another classical technique 
known as local scoring [25). 

Given the likelihood p(y|f) is non-Gaussian, we use 
the standard Laplace approximation: 



Mfly^-A/^A- 1 ), 



(23) 



where f = argmaxf p(f |y, X, 9) and the approximated 
covariance matrix A = — VVlogp(f|y, X, #)| f _j. We de- 
fine the following objective: 



fi(f)=logp(y|f)+logp(f|X,< 



(24) 



f is found by applying Newton's method to this objec- 
tive. Newton's method is guaranteed to converge to the 
global optimum, given the objective in Eq. ((24) is convex 
with respect to f. This is the case for many cases of 
generalized regression, as the likelihood term is usually 
log-concave, as well as the prior term. 

If we assume that f is drawn from an additive GP, 
then it follows that the required gradient and Hessian 
(for Newton iterations) are: 



vo(f) = v f io g p(y|f)-K- 1 d f, 

VVfi(f) = VVf logp(y|f) -K-^. 



(25) 
(26) 



This makes the Newton iteration: 

f (*+i)^ f M + (K ad i d + ^)- 1 

•(v,logp(y|f)| fW -K^fW), 

= K add {K^ + w-y 1 

i (k) +W- 1 V f \ogp(y\{)\ fik) 



(27) 
(28) 



Since the generalized additive GP model assumes condi- 
tional independence of y given f, p(y|f) = n«=i P(Vi\fi)> 
and W is a diagonal matrix and therefore easy to invert. 
Looking closer at Eq. (|28l l, we see that it is precisely 
the same as the expression to compute the posterior 
mean of a GP, where the target vector is given by 

f ( k > + W _1 Vf logp(y|f)| f (fc) and where the diagonal 

"noise" term is given by W _1 . Given an additive kernel 
corresponding to a sum of scalar GPs that can be rep- 
resented using state-space models, we can therefore use 
Algorithm [T] to implement a single iteration of Newton's 
method. As a result, it is possible to compute f in O(N) 
time, since in practice only a handful of Newton itera- 
tions are required for convergence. Wrapping backfitting 



iterations inside a global Newton iteration is precisely 
how the local-scoring algorithm is run to fit a generalized 
additive model [ 25 1 . Thus, we can view the development 
in this section as a novel Bayesian interpretation of local 
scoring. 

The above calculates the posterior Laplace approxima- 
tion. To efficiently approximate the marginal likelihood, 
we use the Taylor expansion of the objective function 
17(F), although we will need to express it explicitly in 
terms of F = [f i; . . . ; fp], as opposed to the sum over f . 



J2(F) =logp(y|F) + logp(F|X,t 



(29) 



Once F is known, it can be used to compute the ap- 
proximation to the marginal likelihood. Using first-order 
Taylor expansion we obtain: 



logp(y|X) 

w Q(F) - i log det (w + K 1 

= logp(y|F)-iF T K _1 F 
1 , , />„ 1 



ND 



log(27r) (30) 



-logdet(K + W J - -log det (W), (31) 

where K is a block diagonal tiling of K!,K 2 , . . . ,Ka 
and W is a block diagonal tiling of the single W matrix. 
We used the matrix determinant lemma to get from Eq. 
d30l to Eq. d3"TT l. Importantly, all the the terms in Eq. <f3T~|) 
can be computed in O(N) runtime due to the fact that 
the latent function can be represented as a sum of state- 
space models (for more details see |5[) . 

In summary, we showed that by using the Laplace 
approximation, we are able to maintain low runtime 
complexity by combining a Newton method with ad- 
ditive regression update in Eq. 1 1281 (local scoring), and 
by approximating the marginal likelihood using Eq. d30b . 

2.3 Gaussian Processes on Multidimensional Grids 

The second GP kernel structure that can be exploited 
involves the assumption of equispaced inputs. This is 
commonly seen for GP regression in time and space (e.g., 
regular measurements at evenly spaced weather stations, 
or video captured by a CCD camera). Even though there 
are good Toeplitz methods for scalar equispaced inputs 
1 26 1, the extension to a multidimensional grid has not 
been addressed in literature. In this section, we present 
a novel method to perform exact inference in 0(N) time 
for any tensor product kernel (most commonly-used 
kernels are of this form), using properties of Kronecker 
products. 

In this case, we can compute all the computation- 
ally troublesome quantities involved (such as (Kjy + 
<7^Ijv) _1 y) using a few matrix-vector products (of size 
N). Importantly, these matrix- vector products have the 
form: 

a=|^AiJb, (32) 
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all of which can be computed in linear runtime. 

Computing a. using standard matrix-vector multipli- 
cation is an 0{N 2 ) operation. However, with problems 
of this form, it is possible to attain linear runtime us- 
ing tensor algebra [27 1. For the relevant background in 
tensor algebra, see the appendix in our supporting work 
|5| . The product in Eq. | |32l l can be viewed as a tensor 
product between the tensor - x iD - D representing 
the outer product over [Ai, . . . , A^], and • . The 

term Tf Bt ...j 1 represents the length- N vector b. Con- 
ceptually, the aim is to compute a contraction over the 
indices ji, ■ ■ ■ ,Jd, namely: 

Gi Gi 

where T" is the tensor representing the solution vector 
a. As the sum in Eq. ll33l is over N = Yld=i Gd elements, 
it will run in 0(N) time. Equivalently, we can express 
this operation as a sequence of matrix-tensor products 
and tensor transpose operations. 



a = vec Ai . . . A D -i A D T 




, (34) 



where we define matrix-tensor products of the form Z = 
XT as: 

sizc(T,l) 

'Z*i\...in ~ ^ . -^iifc1'fei 2 ...i D ■ (35) 
k 

The operator T is assumed to perform a cyclic permu- 
tation of the indices of a tensor, namely 



Y 



(36) 



Furthermore, when implementing the expression in Eq. 
d34l it is possible to represent the tensors involved using 
matrices where the first dimension is retained and all 
other dimensions are collapsed into the second, resulting 
in a matrix B which is a G^-by-TL^ Gj matrix, where 
Gd is the number of elements in dimension d. Algorithm 
|2] gives pseudo-code illustrating these steps. In short, 
we use tensor algebra to create a linear-runtime method 
(Algorithm IJJl for doing matrix-vector multiplications 
across a Kronecker product matrix, which arises quite 
naturally for most GP kernels on a grid of inputs. 

The critical second step is to note that (K + o^Iat) -1 
cannot itself be written as Kronecker product, due to 
the perturbation on the main diagonal. Nevertheless, it 
is possible to sidestep this problem using the eigende- 
composition properties and the identity 



{K + a 2 n I N ) 1 y = Q(A + a 2 n I N ) 1 Q T y. 



(37) 



Importantly, the eigenvector matrix Q will also be a Kro- 
necker product. Hence, to then efficiently solve Eq. 07K 
we first evaluate and perform eigendecompositions of 
covariances along individual dimensions to get [Q d , A^]. 
This has complexity 0((ma,Xd Gd) 3 ), which is negligible 



Algorithm 2: Efficient matrix-vector multiply for Kro- 
necker matrices 

inputs : D matrices [Ai . . . Ad], length- N vector b 

outputs: at, where ol = (&)j =1 AdJ b 

1 x b; 

2 for d <— D to 1 do 

3 G d <- size (A d ) ; 

4 X <— reshape (x, Gd, N/Gd) ', 

5 Z <— ArfX > Matrix-tensor product 

6 Z <— Z T > Tensor rotation 

7 x <— vec (Z) ; 

8 end 
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x; 



Ild=i G d- Next, we calculate Eq. O 



compared to N 
in three steps: 



a <- kron_mvprod([Q^,...,Qj],y) , (38) 



a 
a 



(A + a 2 n I N ) a, 
kron_mvprod([Q 1 , . . . ,Q D ],a) 



(39) 
(40) 



where we efficiently used kron_mvprod (Alg. |2]l twice 
and noting that the matrix A + a 2 In is easy to invert 
as it is diagonal. Computation of the test set predic- 
tions Q (A + ct 2 In) Q T Kj\/jv, can be done efficiently 
using the same approach. The result is the third main 
contribution of this paper: a linear-runtimaS method for 
calculating the key regression equation (K + <t 2 In) y 
using only the assumption that the inputs lie on a grid. 

In summary, we exploited two important realizations: 
efficient eigendecomposition using properties of the Kro- 
necker product, and tensor products enabling fast multi- 
plication by matrices that can be written as a Kronecker 
product. This novel improvement for exact GP inference 
opens the door to a whole new set of applications, which 
would have never been considered otherwise, such as 
GP on images or videos. 

3 Results 

Here we will compare the algorithms discussed in the 
paper to other commonly used algorithms both in the GP 
world, and other common machine learning techniques. 

3.1 Multidimensional Regression 

In this section we will compare methods for multidimen- 
sional regression on both simulated and real experimen- 
tal data. For each experiment presented, we will compare 
both runtime and accuracy. If a particular algorithm has 
a stochastic component to it (e.g., if it involves MCMC) 
its performance will be averaged over 10 runs. Every 

1. The complexity of the algorithm is O ((log D N) 3 Nj, however it 
rapidly converges to O(N) as the number of dimensions grows. 
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experiment was composed of training (i.e., smoothing 
and hyperparameter learning given {X,y}) and testing 
phases. In each experiment, we used 1000 points for test 
sets. 

In terms of accuracy, we use two standard perfor- 
mance measures: normalized mean square error (NMSE) 
and test-set Mean Negative Log Probability (MNLP). 

E£i(y*W-A**0) 2 



NMSE 



MNLP 



1 N * 

2~/v7^ 

i=l 



y) 2 

(y*(*)-M*)) 2 

v*(<) 



+ logv+(i) + log27r 



where ^ = E(f*|X, y, X„, 9), v* = V(f*|X, y, X„ 0), and 
y is the training-set average target value. These measures 
have been chosen to be consistent with those commonly 
used in the sparse GP regression literature. We compare 
runtime performance in seconds, taking into account 
both the learning and prediction phases. 

We test the following algorithms (with the follow- 
ing names): the full naive GP implementation (Full 
GP), additive models (Sections 12.1.11 and I2.1.2|l using 
VBEM inference (Additive-VB) and the MCMC infer- 
ence (Additive-MCMC), projected additive models us- 
ing greedy projection pursuit of Section 12.1.31 (PPGPR- 
Greedy) and a variation of MCMC (PPGPR-MCMC). 
Finally, for the sparse GP method we used the sparse 
pseudo-input Gaussian process (SPGP) (18| . However, 
to be conservative, we did not learn the pseudo inputs 
(which can potentially greatly increase the algorithm 
complexity and runtime) but rather used a random 
subset of the inputs as the active set. For both the SPGP 
and the Full-GP, we used the GPML Matlab Code version 
3.1 [28 1. Also note that, for Additive-VB and PPGPR- 
greedy we have set the number of outer loop iterations 
(the number of VBEM iterations for the former, and the 
number of projections for the latter) to be at maximum 
10 for all AT. Increasing this number increased the cost 
with no change to accuracy, so this is a reasonable choice. 
All algorithms were run both as a single thread and 
using a parallel multicore, but since SPGP and and Full- 
GP do not offer efficient implementation of the parallel 
schemes, their results were the same for both casefl 

3.1.1 Synthetic Data Experiments 

First we used synthetic data generated by the following 
model: 



D 



fdO^(0;fc d (x d ,x^[l,l])) d=l 
e~7V(0,0.01), 



i=l,...,N, (41) 
,D, (42) 



2. When discussing parallel schemes we refer to only the learning 
stage. As in all GP frameworks, parallelism can always be used for 
prediction, since we are only interested in the predictive marginals 
per test point. However, this does not have any noticeable effect on 
the runtime and is thus unimportant to the comparison. 



where fc d (x£),x^; [1, 1]) is given by the Matern(7/2) ker- 
nel with unit lengthscale and amplitude. We used D = 8 
dimensions, and collected runtimes for a set of values 
for N ranging from 1000 to 50000. 

Figure |3] illustrates the significant computational sav- 
ings attained by exploiting the structure of the additive 
kernel. To find the relationship between the number of 
inputs to the runtime, we calculated a linear slope of 
the data in the log-log scale. As expected, the slope of 
the Full-GP is close to three due to its cubic complexity, 
and all the approximation algorithms have runtimes that 
scale linearly with the input size. We can also see that 
parallel processing of the state-space model matrices 
offers further improvement in scaling. These results 
serve only as a rough estimate, because the performance 
can depend on the chosen algorithm parameters, such 
as: number of outer loop iterations in the Additive-VB, 
number of projections in PPGPR-greedy or number of 
samples in the MCMC methods. This runtime/accuracy 
consideration should be used when comparing the effi- 
ciency of the algorithms. 



■ PPGPR-Greedy 
■PPGPR-MCMC 

Additive-VB 
■Additive-MCMC 

SPGP 
■Full-GP 




nm7168 



Fig. 3: A comparison of runtimes for efficient Bayesian 
additive GP regression, with D = 8, N = 
[2; 4; 6; 8; 10; 20; 30; 40; 50] x 10 3 , presented as a log-log 
plot. The algorithms ran on a Linux server, once as a 
single thread (dash lines) and once in a multicore parallel 
scheme using 8 processors (solid lines). At N=7168, 
we added an overlay of the runtime results for the 
pumadyn8-nm dataset (described in Section I3.1.2|l for 
both single ('x') and multicore ('o') runs. 

Additionally, runtime on a modern computer is by 
no means a perfect measure of algorithmic complexity. 
Nonetheless, we will see that the results of Fig. |3] agree 
with all the results from the real datasets. For example, 
in Fig. [3] we overlay the results of one of the real datasets, 
and one sees a close correspondence between synthetic 
and real data. Thus, these and subsequent results are 
highly representative and assert the primary point of this 
section: the runtime of our approximation algorithms do 
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indeed scale linearly with N, versus the cubic scaling of 
the naive GP implementation, not GP-grid. 

Fig. |4] shows the effects of increased dimensionality on 
the approximate algorithms. In this figure we show the 
runtime speedup of the algorithms with respect to the 
runtime of the Full-GP on the synthetic data generated 
with dimensionality of either D = 8 or D = 32. In all the 
runs the number of inputs was set to N = 8000, and the 
algorithms were run once with a single thread (1 worker 
= 1W), and once using the parallel scheme (8 workers = 
8W). In the multidimensional case, the projection pursuit 
algorithm exhibits the largest speedup, as it allows for 
a reduction in the number of effective dimensions (via 
the greedy selection). Notably, PPGPR-Greedy achieves 
consistently an order of magnitude improvement over 
SPGP. 
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Fig. 4: A comparison of the speed up offered by the 
approximation algorithms compared with exact GP. The 
runtime was measured on the learning stage for three 
approximation algorithms: sparse GP, Additive VB, and 
greedy Projection-Pursuit. The comparison was done 
using synthetic results with different dimensions (8D 
and 32D), and running on both a single and multicore 
(8-core) computer. As can be seen from the figure, the 
greedy Projection-Pursuit offers the highest speedup, 
and is especially efficient in high dimensions. 



3.1.2 Real Data Experiments 

Next, we extend the comparison to real datasets, which 
will allow thorough accuracy comparisons. We test 
over seven well-known datasets. These data sets are: 
synth-8D (N = 8000 synthetic data from Section |3XT] |. 
Next, the pumadyn family is a robotic arm dataset from 
121 , and consist of three datasets: pumadyn8-fml 
(N = 1000, fairly linear data with D = 8 dimensions), 
pumadyn8-fm7168 (N = 7168, fairly linear data with 
D = 8 dimensions), pumadyn32-nm (N = 7168, highly 
nonlinear data with D = 32). Elevators dataset con- 
sists of the current state of the fl6 aircraft (N = 8752, 
17-dimensional) |29], and kin40k is a highly nonlinear 



dataset (N = 10000, 8-dimensional) as introduced in 
|30J. Fig. [5] demonstrates the central analysis of this 
section. In each subplot, we calculate speedup, MNLP, 
and NMSE across all seven datasets and six algorithmic 
options. To reiterate, the two additive and two PPGPR 
algorithms are our advances of Section [27TT and our main 
comparison points are SPGP [18] and a naive full GP 
implementation. The top subplot in Fig. [5] indicates the 
substantial speedups offered by all algorithms over the 
full GP, with the exception only of the N = 1000 dataset 
(pumadyn8-f ml 0; this is not surprising given small 
N). Further, as indicated in Figure [3j our PPGPR-Greedy 
achieves the largest speedup across all datasets, and in 
most cases the error (MNLP and NMSE) is the same as 
competing methods. The first four or five datasets tell a 
very similar accuracy story across PPGPR-Greedy, SPGP, 
and the full GP. We also see that the simple additive 
models almost always underperform in accuracy, which 
is as expected given their limited expressivity compared 
to PPGPR-Greedy. The one exception where Additive- 
VB outperforms PPGPR-Greedy is the synthetic data set. 
However, this is expected as we used an additive model 
to generate data and the greedy nature of PPGPR-Greedy 
causes it to underperform. In the final two datasets, we 
see that SPGP and the full GP have considerably better 
accuracy. This may be explained as both these datasets 
are highly nonlinear, making the additive assumption 
inaccurate. 

Understanding the runtime-accuracy tradeoff based 
on problem requirements is essential. As we just de- 
scribed, PPGPR-greedy achieves the best runtimes but at 
times with an accuracy cost. Thus we want to quantify 
the notion of a runtime-accuracy tradeoff. To do so 
we plot all data sets and algorithms in a runtime vs. 
error plot (Fig. [6), and we use the economics concept 
of Pareto efficiency: efficient points in the runtime vs 
error plot represent algorithms with minimum runtime 
for a given error rate. Pareto inefficient algorithms are 
then those points that are unambiguously inferior. The 
efficient frontier is the convex hull of all {runtime,error} 
points (algorithms) for a given dataset. This will give us 
a clear picture of which algorithms are optimal choices 
across a range of datasets. Fig. [6] details this, with one 
efficient frontier for each dataset (a given color). Each 
algorithm has a given marker type. This immediately 
shows what one would expect: the full GP implementa- 
tion is typically most accurate, but only if one is willing 
to invest substantial runtime. This choice is often Pareto 
efficient. Secondly, most often the PPGPR-greedy is the 
other efficient choice for a substantially reduced runtime, 
albeit higher error. Surprising to note is the relative 
weakness of SPGP over several datasets. 

In Table [T] we count of the number of datasets where 
each algorithm is on the efficient frontier, which gives an 
idea of how often an algorithm is competitive with oth- 
ers, or optimal given a particular runtime or error bud- 
get. Three algorithms stand out in their overall efficiency: 
PPGPR-Greedy, SPGP, and full GP. The PPGPR-Greedy 
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is the only consistent efficient algorithm for all datasets as 
it achieves the fastest runtime. However, more interest- 
ingly, it also achieves very good accuracy results making 
most other algorithms inefficient. Of course, any trivial 
algorithm could achieve efficiency by having minimal 
runtime and arbitrary error, but the data demonstrates 
that this is not the case with our algorithms: the PPGPR- 
greedy error in almost all datasets is competitive or 
better than all alternatives. Thus the frequent efficiency 
of PPGPR-greedy is legitimate. 
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Fig. 5: These figures offer a comparison between the 
different GP methods discussed in the text, taking into 
account both speedup and accuracy. For comparison we 
used several known datasets from literature and ran 
the algorithms on a multicore (8-core) computer. The 
top figure illustrates the speedup of the approximation 
algorithms runtimes with respect to the full GP (exact 
inference) runtime. The bottom two figures show two 
metrics for calculating regression accuracy. 



TABLE 1: Efficiency comparison, showing the number of 
datasets where the algorithm was on the Pareto efficient 
frontier. There were seven datasets tested. 



Algorithm 


Pareto Efficient Frontier Count 


PPGPR-Greedy 


7 


PPGPR-MCMC 





Additive- VB 


1 


Additive-MCMC 





SPGP 


4 


Full-GP 


6 



3.2 Multidimensional Classification 

In this section we will compare the generalized additive- 
GP from Section 12.21 to other kernel classifiers (both 
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Fig. 6: The two fundamental desiderata of our algorithms 
are accuracy and speed. Here we plot error vs runtime to 
quantify the tradeoff between these two objectives using 
the notion of Pareto efficiency. Every algorithm is repre- 
sented using a unique marker and with a color scheme 
chosen according to the datasets. For each dataset, the 
Pareto efficient frontier is shown as a color line passing 
through the efficient algorithms for that dataset. 



Bayesian and non-Bayesian). We use common perfor- 
mance metrics from the sparse GP classification liter- 
ature, enabling straightforward comparison with other 
experimental results. In this paper we will focus on 
the task of binary classification, however in principle, 
extensions to tasks such as multi-class classification and 
Poisson regression can be performed without affecting 
asymptotic complexity. For performance measures we 
use: algorithm runtime (in seconds), error rate, and the 
test-set mean negative log-likelihood (MNLL): 



Error Rate 



# (incorrect classifications) 



(43) 



#(test cases) 

1 Nt 

MNLL = — _W logft + (1 " w) Ml " Pi)} • (44) 

* i=l 

For both the test error rate and MNLL measures lower 
values indicate better performance. 

3.2. 1 Synthetic Data Experiments 

We used synthetic data generated by the following 

model: 



Ui - Bernoulli^ ) i = 1, . . . , JV, 

Pi = g(fi), 

fO =!>(■), 

d 

f «_(■) ~ GP (0; fc d (x d , x^; e d )) d=l,...,D, 



(45) 
(46) 



where g{fi) is the logistic link function. 

Within the GP framework, we compared generalized 
additive GP Regression from Section 12.21 (Additive-LA), 
standard GP classification with Laplace's approximation 



(Full-GP) [2], and sparse GP methods of informative 
vector machine (IVM) |23| and fully independent con- 
ditional (FIC) (TJ. For completeness, we also include 
support vector machines (SVM) [31J0 For the Full-GP 
we used GPML Matlab Code version 3.1 [28]; for FIC 
we used the GPstuff Matlab package |55J; for SVM 
we used LIBSVM 1531 ; and for IVM we used the im- 
plementation given in |23l . We tested the algorithms 
on the synthetic data from the model above using 8 
dimensions while varying the number of inputs N = 
[2; 4; 6; 8; 10; 20; 30; 40; 50] x 10 3 . We stopped running the 
Full-GP at 10000 as it took too long to finish. A compar- 
ison of the runtime results is shown in Fig. [7] To be con- 
sistent, we used exactly 25 iterations for all algorithms 
during the learning stage. As can be seen from the figure, 
Additive-GP offers excellent scaling for large input sizes. 
The only algorithm that offers faster runtime than the 
additive-GP is IVM. This can be expected as the IVM 
only uses the information in the active set and discards 
the rest. Our algorithm, on the other hand, makes use of 
all the data, and is thus able to achieve a more accurate 
estimation, as the results below demonstrate. 




Input Size (N) 



Fig. 7: This figure shows the runtime of the classification 
algorithms for the synthetic dataset with D = 8, N = 
[2; 4; 6; 8; 10; 20; 30; 40; 50] x 10 3 . For the learning stage we 
used 50 iterations, and we did prediction on 1000 points. 
The log-log slopes of the algorithms are: Full-GP = 2.75, 
Additive-GP = 1.07, FIC = 1.53, IVM = 0.80, SVM = 
2.16. 



3.2.2 Real Data Experiments 

We tested the classification algorithms from the pre- 
vious section on three additional popular datasets: 
Breast Cancer [35 J, Magic Gamma Telescope |36|, 
and IJCNN [37]. We again only allowed 25 iterations in 
the learning stage. For sparse methods we tested two 
activeset sizes: 50, and O.liV. Table [2] summarizes the 

3. To calculate the MNLL, we used the probabilistic predictions from 
the SVM using cross-validation and cross-entropy metric 1321 
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classification results across all algorithms and datasets. 
Each column gives the classification error rate, MNLL, 
and runtime. First, we consider the runtime. We note 
that, as expected from Figure [7j the Full GP has the 
least attractive runtime in all but the Breast Cancer 
data (due to the small data size). The IVM has the 
best runtime performance across all datasets, after which 
our Additive-LA method is superior. In terms of per- 
formance, error rates are fairly consistent throughout 
the data sets, with the exception of notably high errors 
for IVM in the last two data sets. This is echoed by 
MNLL, where the IVM tends to have a significantly 
larger error. These results are not as compelling as in 
the regression case (as is often the case when comparing 
Bayesian methods to an SVM), but the Additive-LA 
is competitive overall. Thus, if a Bayesian method is 
needed for nonparametric classification, the Additive-LA 
approach is a viable and stable solution. 

TABLE 2: Performance Comparison of efficient Bayesian 
additive GP classification algorithms with commonly- 
used classification techniques on larger datasets. 



Algorithm 


Error Rate 


MNLL 


Runtime (s) 


Synthetic Additive Data (N 


= 4000, M = 


1000, D = 8) 


Full-GP 


0.6040 


0.7402 


2244.5111 


Additive-LA 


0.2800 


0.5929 


161.1185 


FIC - 50 


0.2800 


1.0640 


525.6684 


FIC - 400 


0.4550 


1.3612 


850.8427 


IVM - 50 


0.2800 


0.6931 


65.1951 


SVM 


0.2940 


0.5838 


345.7267 


Breast Cancer (N = 359, M = 90, D = 9) 


Full-GP 


0.0667 


0.1436 


6.0435 


Additive-LA 


0.0667 


0.1215 


89.2993 


FIC - 50 


0.0556 


0.0999 


27.9584 


FIC - 36 


0.0556 


0.0999 


55.9182 


IVM - 50 


0.0667 


0.6382 


12.4867 


SVM 


0.0556 


3.1717 


1.7062 


Magic Gamma Telescope (N - 


= 15216, M = 


3804, D = 10) 


Full-GP 


NA 


NA 


NA 


Additive-LA 


0.1393 


0.3419 


2345.6546 


FIC - 50 


0.1441 


0.3656 


3339.6185 


FIC - 1522 


0.1396 


0.3654 


7331.2780 


IVM - 50 


0.6583 


0.6932 


118.0407 


SVM 


0.1191 


0.3026 


8070.2400 


IJCNN (N = 49990, M = 91701, D = 13) 


Full-GP 


NA 


NA 


NA 


Additive-LA 


0.0516 


0.1560 


14505.5000 


FIC - 50 


0.0482 


1.1859 


4390.2498 


FIC - 4999 


0.0770 


0.8171 


16728.0911 


IVM - 50 


0.0950 


0.6932 


369.0000 


SVM 


0.0166 


0.0509 


22170.1000 



3.3 Multidimensional Regression on a Grid 

In this section we consider data with input on a multidi- 
mensional grid. We compare the exact GP-grid method 
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from Section 12.31 to the naive Full-GP method and show 
an application for this method in image reconstruction. 

We first start by comparing the runtime complexity of 
the GP-grid to Full-GP. For this we ran both algorithms 
on synthetic data where each synthetic dataset has input 
locations at the corners of the { — 1,1} hypercube in D 
dimensions. The target value is set to noise. At each run 
we increased the dimension D by 1, thereby multiplying 
the number of input points by 2. Fig. [8] illustrates the 
time it takes for one iteration during the learning stage. 
As can be seen, the GP-grid scales linearly with the input 
size while the Full-GP is cubic. 




Fig. 8: Runtimes of naive Full-GP in red and GP regres- 
sion on Grid using Kronecker product (GP-Grid from 
Sec. E3 in black, for N = [2 s ; 2 9 ; ... ; 2 20 ]. The slope for 
the naive Full-GP is 2.6 and that of Kronecker product 
is 1.05 (based on last 7 points). This empirically verifies 
the improvement to linear scaling. 

Improving the scalability of exact GP on a multi- 
dimensional grid may seem like an unusual special 
case, but it importantly enables new applications which 
were not tractable previously. One such application is 
picture reconstruction as pictures are an equispaced grid 
of pixels. Here we show how GP-grid can be used 
to reconstruct and interpolate a noisy image. We first 
apply GP-grid on a 200 x 200 pixel noisy image (Fig. 
[9at and use the inferred GP posterior as a denoised 
reconstruction (Fig. [9b). Note that the GP nicely smooths 
out most of the compression artifacts that were present 
in the original image. Note also that this is a GP on 
N = 40000, which is hopelessly intractable for Full-GP. 
However, GP-grid was able to learn the parameters in 
42.75 seconds, and denoise in 1490.3 seconds. Further, 
because the two methods are mathematically proven to 
be identical, any small numerical differences will be due 
to relative instability of the naive GP implementation. 

Next, we down-sampled the original image by j (Fig. 
|9cT l and used GP-grid to interpolate the missing values 
(Fig. I9d|> . Here the learning stage was 4.57 seconds and 



the reconstruction with interpolation was 186.61 seconds, 
and the overall quality of the interpolated reconstruction 
is still high. This image example is only a single example, 
but it is representative: due to the Kronecker method of 
Section 12.31 we have a provably exact GP method that 
scales linearly in the number of data points. Image and 
video analysis is a critical and common machine learn- 
ing application, but the use of nonparametric Bayesian 
algorithms in this domain is infrequent. Our GP-grid al- 
gorithm importantly enables the use of GP technologies 
in this application area. 




(c) (d) 

Fig. 9: Illustration of GP regression on a picture as an 
equidistance grid. Fig. [9a] shows a 200 x 200 pixels of 
the original noisy picture, in Fig. [9b] we used the GPR- 
grid method from Section 12.31 to learn the parameters 
and reconstruct the picture. Fig. [9c] is a down-sampled 
the original picture (Fig. 19a) by 2 in both dimensions (| 
of the data). Fig. I9dl shows a reconstruction of the pic- 
ture where the GPR-grid used the down-sampled data 
for learning, and then predicted the values at missing 
locations. 



4 Discussion and Conclusion 

Gaussian Processes are perhaps the most popular non- 
parametric Bayesian method in machine learning, but 
their adoption across other fields - and notably in appli- 
cation domains - has been limited by their burdensome 
scaling properties. Having fast, scalable methods for 
Gaussian Processes may mean the difference between a 
theoretically interesting approach and a method that is 
widely used in practice. 

While important sparsification work has somewhat 
addressed this scalability issue, the problem is by no 
means closed. Our aim here has been to explore the use 
of structured GP models. We made nontrivial advances 
to existing state-space and equispaced GP methods in 
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order to extend structured GP techniques into the multi- 
dimensional input domain. Our results (Section [3} illus- 
trate across a range of data and different algorithms that 
structured models are most often superior to the state of 
the art sparse methods (SPGP). Notably, we introduced 
projection pursuit Gaussian Process regression (PPGPR- 
greedy, Section [2.1. 3\ , an O(N) runtime algorithm that 
combines the computational efficiency of additive GP 
models (Section 12.1.1b with the expressivity of a mul- 
tidimensional coupled model. The result (Section 13.1.21 
notably Table Q} is an algorithm that has a superior 
runtime-accuracy tradeoff than several other algorithms 
including the sparse SPGP. While its accuracy was often 
slightly lower than a full GP, the linear scaling properties 
of PPGPR mean that it can be efficiently used across 
a much broader range of data sizes and applications. 
The primary takeaway of this work is thus: while the 
naive GP implementation may often produce the highest 
accuracy, the PPGPR algorithm that we introduced offers 
the best runtime-accuracy tradeoff across many datasets 
and is able to scale well beyond the realm of a naive GP. 

Of course, in some cases the researcher will prefer 
the SPGP method over PPGPR-greedy. Indeed, in many 
senses, these two approaches are orthogonal to each 
other. We see this as an inherent fact in approximation 
techniques: various methods will be more appropriate in 
different settings. Our results (Section [3) presented an in- 
depth investigation into this runtime-accuracy tradeoff, 
using both metrics on real datasets and meta-analyses 
of Pareto efficiency. Our well-founded and competitive 
alternatives for efficient GP regression and classification 
can thus enable the researcher to make an informed 
choice about a GP method for a given data size, data 
complexity, and available computational resource. 

To the point of runtime-accuracy tradeoff, there are 
sometimes opportunities for great scaling advantages 
with no accuracy tradeoff whatsoever. We demonstrated 
such an example with equispaced inputs in Section l2.1.3l 
Though this method exploits structure differently than 
the main PPGPR-greedy algorithm, our novel use of ten- 
sor algebra to create an O(N) GP model belongs in this 
exposition of the computational advantages of careful 
structural consideration. Notably, this method also opens 
up an entirely new set of big-data applications, such 
as image and video processing, or financial engineering 
applications such as implied volatility surfaces. Our 
future work is pursuing these application domains. 

As a last computational point, as growth in com- 
putational speed is increasingly driven by parallelism 
over raw processor speed, it will become increasingly 
important to use GP schemes that naturally incorporate 
parallel processing, to efficiently deal with the rapid 
growth of future datasets. Our PPGPR-greedy method 
stands out in this regard versus both the naive full GP 
and SPGP, and again the results of Section [3] reiterated 
this fact. 

Finally, from an algorithmic perspective, another in- 
teresting byproduct of this work was a number of sur- 



prising connections to classical statistical techniques. The 
additive model turned out to be a Bayesian interpreta- 
tion of the backfitting algorithm, importantly yielding 
an alternative proof of that algorithm's validity. We 
utilized another classical technique - projection pursuit - 
in the PPGPR model, which dramatically increased the 
expressivity of the additive model without sacrificing 
O(N) performance. 

Understanding how our existing nonparametric mod- 
els can scale and be used in real data, and how these 
models connect to other areas of statistics, will increase 
the utility of machine learning algorithms in general. 
This is perhaps most important with Gaussian Processes, 
which promise a wide range of useful applications. 
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